In this session, we will use the Black Friday Data available in [Kaggle] (https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) to study how to make the following graphical displays.
Here is a list of common arguments:
In this session, we will use the Black Friday Data available in [Kaggle] (https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) to study how to make the following graphical displays.
Here is a list of common arguments:
In order to understand the customer purchases behavior against various products of different categories, the retail company “ABC Private Limited”, in United Kingdom, shared purchase summary of various customers for selected high volume products from last month. The data contain the following variables.
Rows: 550,068
Columns: 12
$ User_ID <dbl> 1000001, 1000001, 1000001, 1000001, 1000002…
$ Product_ID <chr> "P00069042", "P00248942", "P00087842", "P00…
$ Gender <chr> "F", "F", "F", "F", "M", "M", "M", "M", "M"…
$ Age <chr> "0-17", "0-17", "0-17", "0-17", "55+", "26-…
$ Occupation <dbl> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, 20, 20…
$ City_Category <chr> "A", "A", "A", "A", "C", "A", "B", "B", "B"…
$ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2", "2", "2…
$ Marital_Status <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0…
$ Product_Category_1 <dbl> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1,…
$ Product_Category_2 <dbl> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, 11, NA…
$ Product_Category_3 <dbl> NA, 14, NA, NA, NA, NA, 17, NA, NA, NA, NA,…
$ Purchase <dbl> 8370, 15200, 1422, 1057, 7969, 15227, 19215…
Bar chart is a graphical display good for the general audience. Here, we study the distribution of Age Group of the company’s customers who purchased their products on Black Friday.
Usage: barplot(height,…)
A bar chart can be horizontal or vertical. Using the argument col, we can assign a color for bars. The argument main could be used to change the title of the figure. We can use RGB color code to assign colors.
Note: The margin of a figure could be set using the par() function. The order of the setting is c(bottom, left, top, right).
The bar charts show that the age group of 26-35 make the most purchases. The distribution is pretty symmetrical with the youngest and oldest age groups making the least amount of purchases. The 36-45 and 18-25 age groups are similar and have the second and third most purchases but still have half the amount of purchases as the 26-35 age group.
Similarly, we can use pie chart to study the distribution of the city category.
Usage: pie(height,…)
Tip: Use color palette to choose colors (Google search: color scheme generator).
This pie chart shows that the largest percentage of customers are from city category B. Then, city category C has the second largest percentage of customers. City category A has the lowest percentage of customers who are from there. None of the percentages are drastically larger or smaller than the others.
Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of customer purchase amount.
Usage: hist(x, …)
This histogram is multimodal and has many increases and decreases throughout. It is slightly right skewed with the larger purchase amounts being less common. The median is around 8000 and the mean is around 9200.
Here, we talk about another graphical display that can be used to study the distribution of a quantitative variable: box and whisker plot (boxplot).
Usage: boxplot(x, …) or boxplot(formula, …)
In general, a boxplot is used when we want to compare the distribution of several quantitative variables. In the following we study the distribution of customer purchase amount among different age groups.
In this boxplot we can see that the range is from about 0 to 25000. The median is around 8000, and the IQR is about 6000 with Q1 and Q3 being close to 6000 and 12000. There are many outliers at the top and the median is more toward the bottom which makes the distribution right skewed.
This boxplot shows that all four categories of sexes and marital status have similar distributions. The medians are all around 8000 and they all have high outliers. The Male & Single and Male & Married categories have a slightly bigger IQR than the female categories and also seem to be slightly right skewed.
When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this dataset doesn’t have another quantitative variable, we will use the built-in data mtcars in R. Then we study the relationship of miles per gallon against the weight of vehicles.
The scatterplot shows a pretty clear negative linear relationship between the weight and miles per gallon in cars. This means that as the weight of the car increases, the miles per gallon decreases.
Since the Black Friday Data are not time series data, it is not appropriate to use a line plot. In the following code chunk, we create a data frame using the forceasted highest temperature from July 13 to July 22 in 2022 ([The Weather Channel] (https://weather.com/))
The line chart shows that Houston has the highest recorded temperature and Fargo has the lowest. Overall, Houston seems to have the highest temperatures. Even though Denver gets higher than Houston at times, it also gets a lot lower. Fargo seems to have the lowest overall temperatures. Dayton has temperatures in the middle of the others. Denver and Fargo have the most drastic changes while Houston and Dayton stay somewhat consistent.
---
title: "Basic Graphical Displays"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: default
navbar-bg: "purple"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
library(plotly)
Friday<- read_csv("~/Desktop/MTH 209/Black_Friday.csv")
```
Brief Overview 1
===
Column {data-width=450}
---
In this session, we will use the Black Friday Data available in [Kaggle] (https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) to study how to make the following graphical displays.
Column {.tabset data-width=550}
---
### Graphical Displays
- Categorical Data
- Bar Chart
- Pie Chart
- Qualitative Data
- Histogram
- Boxplot
- Scatterplot
- Line
### Common Arguments
Here is a list of common arguments:
- col: a vector of colors
- main: title for the plot
- xlim or ylim: limits for the x or y axis
- xlab or ylab: a label for the x or y axis
- font: font used for text, 1=plain, 2=bold, 3=italic, 4=bold italic
- font.axis: font used for axis
- cex.axis: font size for x and y axes
- font.lab: font for x and y labels
- cex.lab: font size for x and y labels
Brief Overview 2 {data-orientation=rows}
===
Row {data-height=100}
---
In this session, we will use the Black Friday Data available in [Kaggle] (https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) to study how to make the following graphical displays.
Row {data-height=900}
---
### Graphical Displays
- Categorical Data
- Bar Chart
- Pie Chart
- Qualitative Data
- Histogram
- Boxplot
- Scatterplot
- Line
### Common Arguments
Here is a list of common arguments:
- col: a vector of colors
- main: title for the plot
- xlim or ylim: limits for the x or y axis
- xlab or ylab: a label for the x or y axis
- font: font used for text, 1=plain, 2=bold, 3=italic, 4=bold italic
- font.axis: font used for axis
- cex.axis: font size for x and y axes
- font.lab: font for x and y labels
- cex.lab: font size for x and y labels
Data
===
Column {data-width=550}
---
### <b><font size=4><span Style = "color:blue">First 500 Observations</span></font></b>
```{r show_table}
datatable(Friday[1:500,], rownames=FALSE, colnames= c("User ID", "Product ID", "Gender", "Age", "Occupation", "City Category", "Stay in Current City Years", "Marital Status", "Product Category 1", "Product Category 2", "Product Category 3", "Purchase"), options=list(pageLength=20))
```
Column {data-width=450}
---
### <font size= 4><span Style = "color:red">Description</span></font>
In order to understand the customer purchases behavior against various products of different categories, the retail company "ABC Private Limited", in United Kingdom, shared purchase summary of various customers for selected high volume products from last month. The data contain the following variables.
- User_ID: User ID
- Product_ID: Product ID
- Gender: Sex of User
- Age: Age in bins
- Occupation: Occupation (Masked)
- City_Category: Category of the City (A,B,C)
- Stay_In_Current_City_Years: Number of years stay in current city
- Marital_Status: Marital Status
- Product_Category_1: Product Category (Masked)
- Product_Category_2: Product may belong to other category also (Masked)
- Product_Category_3: Product may belong to other category also (Masked)
- Purchase: Purchase Amount
```{r}
glimpse(Friday)
```
Bar Chart
===
Row {data-height=350}
---
###
Bar chart is a graphical display good for the general audience. Here, we study the distribution of Age Group of the company's customers who purchased their products on Black Friday.
**Usage:** barplot(height,...)
A bar chart can be horizontal or vertical. Using the argument <span Style="color:orange">col</span>, we can assign a color for bars. The argument <span Style="color:orange">main</span> could be used to change the title of the figure. We can use RGB color code to assign colors.
**Note:** The margin of a figure could be set using the <span Style="color:blue">par()</span> function. The order of the setting is <span Style="color:orange">c(bottom, left, top, right)</span>.
### Analysis
The bar charts show that the age group of 26-35 make the most purchases. The distribution is pretty symmetrical with the youngest and oldest age groups making the least amount of purchases. The 36-45 and 18-25 age groups are similar and have the second and third most purchases but still have half the amount of purchases as the 26-35 age group.
Row {data-height=650}
---
### **Vertical Bar Chart**
```{r bar1}
par(mgp=c(4,1,0)) # change the margin line for the axis title, axis labels and axis line
par(mar=c(5,7,4,2)) # set margin of the figure
barplot(table(Friday$Age), col= "lightblue", main= "Distribution of Purchases by Customer's Age", ylab="Number of Purchases", xlab="Age Group")
```
### **Horizontal Bar Chart**
```{r bar2}
par(mgp=c(4,1,0)) # change the margin line for the axis title, axis labels and axis line
par(mar=c(5,7,4,2)) # set margin of the figure
Friday %>%
ggplot(aes(x=Age))+
geom_bar(fill="#69b3a2")+
coord_flip()+
labs(title="Distribution of Purchases by Customer's Age", y="Number of Purchases", x="Age Group")->bar1
ggplotly(bar1)
```
Pie Chart
===
Column {data-width=500}
---
Similarly, we can use pie chart to study the distribution of the city category.
**Usage:** pie(height,...)
**Tip:** Use color palette to choose colors (Google search: color scheme generator).
### Analysis
This pie chart shows that the largest percentage of customers are from city category B. Then, city category C has the second largest percentage of customers. City category A has the lowest percentage of customers who are from there. None of the percentages are drastically larger or smaller than the others.
Column {data-width=500}
---
### Distribution of City Category
```{r pie}
H<- table(Friday$City_Category)
percent<- round(100*H/sum(H),1) #calculate percentages
pie_labels<- paste(percent, "%", sep="") # include %
pie(H, main="Distribution of City Category", labels=pie_labels,
col=c("#54d2d2","#fb6f92","#f8aa4b"))
legend("topright",c("A","B", "C"), cex=0.8, fill=c("#54d2d2","#fb6f92","#f8aa4b"))
```
Histogram
===
Column {data-width=500}
---
###
Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of customer purchase amount.
**Usage:** hist(x, ...)
```{r histogram}
Friday %>% ggplot(aes(x=Purchase))+
geom_histogram(fill="blue")+
labs(title="Distribution of Customer Purchase Amount", x="Purchase Amount (British Pounds)")
```
Column {data-width=550}
---
### Analysis
This histogram is multimodal and has many increases and decreases throughout. It is slightly right skewed with the larger purchase amounts being less common. The median is around 8000 and the mean is around 9200.
Boxplot
===
Column {.tabset data-width=500}
---
### Boxplot 1
Here, we talk about another graphical display that can be used to study the distribution of a quantitative variable: box and whisker plot (boxplot).
**Usage:** boxplot(x, ...) or boxplot(formula, ...)
```{r boxplot1}
boxplot(Friday$Purchase, xlab="Purchase Amount", ylab="British Pounds")
```
### Boxplot 2
In general, a boxplot is used when we want to compare the distribution of several quantitative variables. In the following we study the distribution of customer purchase amount among different age groups.
```{r boxplot2}
boxplot(Purchase ~ Gender + Marital_Status, data=Friday, main="Distribution of Purchase by Sex and Marital Status",
xlab="Sex and Marital Status", ylab="Purchase", cex.lab=1.25, cex.axis=0.7,
names=c("Female & Single", "Male & Single", "Female & Married", "Male & Married"))
```
Column {data-width=450}
---
### Analysis of Boxplot 1
In this boxplot we can see that the range is from about 0 to 25000. The median is around 8000, and the IQR is about 6000 with Q1 and Q3 being close to 6000 and 12000. There are many outliers at the top and the median is more toward the bottom which makes the distribution right skewed.
### Analysis of Boxplot 2
This boxplot shows that all four categories of sexes and marital status have similar distributions. The medians are all around 8000 and they all have high outliers. The Male & Single and Male & Married categories have a slightly bigger IQR than the female categories and also seem to be slightly right skewed.
Scatterplot
===
Column {data-width=500}
---
When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this dataset doesn't have another quantitative variable, we will use the built-in data <span class="orange">mtcars</span> in R. Then we study the relationship of miles per gallon against the weight of vehicles.
```{r scatterplot}
plot(mpg ~ wt, data=mtcars, xlab="Weight (1000lbs)", ylab="Miles per Gallon", pch=19, col="blue")
```
Column {data-width=500}
---
### Analysis
The scatterplot shows a pretty clear negative linear relationship between the weight and miles per gallon in cars. This means that as the weight of the car increases, the miles per gallon decreases.
Line Plot
===
Column {.tabset data-width=350}
---
### Data
Since the Black Friday Data are not time series data, it is not appropriate to use a line plot. In the following code chunk, we create a data frame using the forceasted highest temperature from July 13 to July 22 in 2022 ([The Weather Channel] (https://weather.com/))
```{r data}
Date<- 13:22
Dayton_OH<- c(84,86,91,89,89,91,92,91,91,91)
Houston_TX<- c(100,97,96,94,94,94,93,93,92,91)
Denver_CO<- c(95,85,89,96,97,96,92,91,95,96)
Fargo_ND<- c(86,80,84,87,90,87,83,84,87,89)
df<- data.frame(Date, Dayton_OH, Houston_TX, Denver_CO, Fargo_ND)
datatable(df, rownames = FALSE, colnames=c("Date", "Dayton OH", "Houston TX",
"Denver, CO", "Fargo ND"))
```
### Analysis
The line chart shows that Houston has the highest recorded temperature and Fargo has the lowest. Overall, Houston seems to have the highest temperatures. Even though Denver gets higher than Houston at times, it also gets a lot lower. Fargo seems to have the lowest overall temperatures. Dayton has temperatures in the middle of the others. Denver and Fargo have the most drastic changes while Houston and Dayton stay somewhat consistent.
Column {data-width=650}
---
### Line Chart
```{r line1}
plot(Date, Dayton_OH, type="o", col="blue", xlab="Date in July", ylab="Highest Temperature", ylim=c(80,100))
lines(Date, Houston_TX, type="o", col="red")
lines(Date, Denver_CO, type="o", col="purple")
lines(Date, Fargo_ND, type="o", col="darkgreen")
# Add a legend
legend("bottomright",legend=c("Dayton, OH", "Houston, TX", "Denver, CO", "Fargo, ND"),
col=c("blue", "red", "purple", "darkgreen"), lty=1, pch=1)
```